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COUPLING HIDDEN MARKOV MODELS FOR THE DISCOVERY 
OF CIS-REGULATORY MODULES IN MULTIPLE SPECIES 

By Qing Zhou and Wing Hung Wong 1 

UCLA and Stanford University 

Cis-regulatory modules (CRMs) composed of multiple transcrip- 
tion factor binding sites (TFBSs) control gene expression in eukary- 
otic genomes. Comparative genomic studies have shown that these 
regulatory elements are more conserved across species due to evo- 
lutionary constraints. We propose a statistical method to combine 
module structure and cross-species orthology in de novo motif dis- 
covery. We use a hidden Markov model (HMM) to capture the module 
structure in each species and couple these HMMs through multiple- 
species alignment. Evolutionary models are incorporated to consider 
correlated structures among aligned sequence positions across differ- 
ent species. Based on our model, we develop a Markov chain Monte 
Carlo approach, MultiModule, to discover CRMs and their compo- 
nent motifs simultaneously in groups of orthologous sequences from 
multiple species. Our method is tested on both simulated and biolog- 
ical data sets in mammals and Drosophila, where significant improve- 
ment over other motif and module discovery methods is observed. 

1. Introduction. Gene transcription is regulated by interactions between 
transcription factors and their binding sites on DNA. The analysis of ge- 
nomic sequences for short sequence elements (czs-regulatory elements) that 
mediate such interactions is an important problem in computational biol- 
ogy. In this paper we develop a method for predicting cis-regulatory elements 
based on the statistical modeling of combinatorial control by multiple tran- 
scription factors, and of cross-species conservation of the regulatory roles of 
these factors. The remaining part of this Introduction provides a review of 
relevant literature and background of our approach. Section 2 presents our 
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statistical model. Section 3 develops the computational algorithms for infer- 
ence from this model. Sections 4 and 5 present evidence for the effectiveness 
of the proposed method through both simulation studies and applications 
to genomic data. Section 6 contains concluding remarks and discussions of 
future work. The last section is an Appendix on mathematical details. 

A transcription factor (TF) recognizes and binds to many different sites 
in the genome. The sites for a TF are usually not identical but share simi- 
larity that can be summarized by a statistical model called a motif. A motif 
is often parameterized by a position-specific weight matrix (PWM) that 
summarizes the relative frequencies for the four types of nucleotides at each 
position of the site. Computational methods for motif discovery and TFBS 
prediction were initiated by Stormo and Hartzell (1989) and further devel- 
oped in Lawrence and Reilly (1990), Lawrence et al. (1993) and Liu, Neuwald 
and Lawrence (1995). An alternative approach based on word enumeration 
instead of PWM was proposed by Bussemaker, Li and Siggia (2000), Hamp- 
son, Kibler and Baldi (2002), Liu, Brutlag and Liu (2002) and Sinha and 
Tompa (2002). 

A number of approaches have been developed recently to increase the 
accuracy of TFBS prediction based on the PWM formulation. The first 
approach utilizes the concept of cis-regulatory modules [Yuh, Bolouri and 
Davidson (1998), Loots et al. (2000)]. A cis-regulatory module is a short se- 
quence element (typically 100-500 bp in length) containing a cluster of TF- 
BSs for one or more TFs. Different TFs binding to the same cis-regulatory 
module cooperate in their control of gene transcription through mechanisms 
such as synergistic binding or sequential recruitment. Computational meth- 
ods that search for CRMs given the PWMs of several interacting TFs were 
proposed by Frith, Hansen and Weng (2001), Frith et al. (2002) and Berman 
et al. (2002). Such searches were combined with the alignment of ortholo- 
gous sequences from several related species [Sinha, van Nimwegen and Siggia 

(2003) ]. When the PWMs are unknown, a de novo module discovery method 
has been developed using a hierarchical mixture model for the CRM [Zhou 
and Wong (2004)]. This method, called CisModule, has been applied success- 
fully to predict the CRMs that control Ciona muscle development [Johnson 
et al. (2005)]. In addition, a model that considers transition probabilities 
and neighboring distances between TFBSs has been proposed in Thompson 
et al. (2004), and was further developed to identify CRMs given a collection 
of potential PWMs [Gupta and Liu (2005)]. 

Several recent methods employ a second approach, that is, multiple genome 
comparison, to enhance the power of cis-regulatory analysis. PhyloCon [Wang 
and Stormo (2003)] builds multiple alignments among orthologs and extends 
these alignments to identify motif profiles. CompareProspector [Liu et al. 

(2004) ] biases motif search to more conserved regions based on conservation 
scores. OrthoMEME [Prakash et al. (2004)] identifies pairs of orthologous 
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TFBSs in two species. With a given alignment of orthologs and a phylo- 
genetic tree, EMnEM [Moses, Chiang and Eisen (2004)], PhyME [Sinha, 
Blanchette and Tompa (2004)] and PhyloGibbs [Siddharthan, Siggia and 
Nimwegen (2005)] detect motifs based on more comprehensive evolutionary 
models for TFBSs. Finally, when evolutionary distances among the genomes 
are too large for the orthologous sequences to be reliably aligned, Li and 
Wong (2005) proposed an ortholog sampler that finds motifs in multiple 
species independent of ortholog alignments. 

Modeling CRMs enhances the performance of de novo motif discovery 
because it allows the use of information encoded by the spatial correlation 
among TFBSs in the same module. Likewise, the use of multiple genomes 
enhances motif prediction because it allows the use of information from the 
evolutionary conservation of TFBSs in related species. Although both types 
of information have been exploited separately, they have not been utilized 
simultaneously for de novo prediction for cis-regulatory elements. We be- 
lieve that an approach that utilizes both pieces of knowledge can further 
improve the power for de novo prediction. In this paper we use a hidden 
Markov model (HMM) to capture the co-localization tendency of multiple 
TFBSs within each species, and then couple the hidden states (which in- 
dicate the locations of modules and TFBSs within the modules) of these 
HMMs through multiple-species alignment. We develop evolutionary mod- 
els separately for background nucleotides and for motif binding sites in order 
to capture the different degrees of conservation among the background and 
among the binding sites. We develop a Markov chain Monte Carlo algorithm 
for sampling CRMs and their component motifs simultaneously from their 
joint posterior distribution. We test the method on both simulated and well- 
annotated biological data sets, and demonstrate that it provides significant 
improvement over other de novo motif and module discovery methods. Com- 
pared to alignment-based motif discovery methods such as PhyME [Sinha, 
Blanchette and Tompa (2004)] and PhyloGibbs [Siddharthan, Siggia and 
van Nimwegen (2005)], our approach has two unique features: (1) We con- 
sider module information through a hidden Markov model; (2) The multiple 
alignments of orthologous sequences are dynamically updated, so that the 
uncertainty in the alignments is taken into account. The advantages of these 
features are illustrated by the examples in this article. 

2. The coupled hidden Markov model. Our input data consist of up- 
stream or regulatory sequences of n (co-regulated) genes from N species, 
that is, a total of n x N sequences. Assuming these genes are regulated by 
CRMs composed of binding sites of K TFs, one wants to find these TFBSs 
and their motifs (PWMs). We only consider N closely related species in the 
sense that their orthologous TFs share the same binding motif, which ap- 
plies to groups of species within mammals, or within Drosophila, and so on. 
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Please note that our model, to be developed in this section, is applicable to 
the situation where different genes have distinct numbers of orthologs, that 
is, some orthologs are missing for some genes. For notational ease, missing 
orthologs are treated as zero-length sequences, so that we always assume n 
sequences from each species. 

2.1. The HMM for module structure. Let us first focus on the module 
structure in one sequence. We assume that the sequence is composed of 
two types of regions, modules and background. A module contains multi- 
ple TFBSs separated by background nucleotides, while background regions 
contain only background nucleotides. Accordingly, we assume that the se- 
quence is generated from a hidden Markov model with two states, a module 
state (M) and a background state (B). In a module state, the HMM either 
emits a nucleotide from the background model (of nucleotide preference) 9q , 
or it emits a binding site of one of the K motifs (PWMs) Q\,@2, ■ ■ ■ , 
The probability for emission from 9q and (k = 1,2, ... ,K) is denoted 
by qo and qk, respectively (J2k=oQk = 1) [Figure 1(A)]. Note that a module 
state can be further decomposed to K + 1 states, corresponding to within- 
module background (Mo) and K motif binding sites (Mi to Mk), that is, 
M = {Mo, Mi, . . . , Mk}- Assuming that the width of motif k is u>k, a bind- 
ing site of this motif, a piece of sequence of length Wk, is treated as one 
state of Mfc as a whole {k = 1, 2, ... , K). The transition probability from a 
background to a module state is r, that is, the chance of initiating a new 
module is r. The transition probability from a module state to a background 
state is t, that is, the expected length of a module is 1/t. We denote the 
transition matrix by 

1 — r r 
t 1-t ' 

This model can be viewed as a stochastic version of the hierarchical mixture 
model (HMx) defined in Zhou and Wong (2004). 

2.2. Coupling HMMs via multiple alignment. The HMMs in different or- 
thologs are coupled through multiple alignment, so that the hidden states 
of aligned bases in different species are collapsed into a common state [Fig- 
ure 1(B)]. For instance, the nucleotides of state 4 in the three orthologs 
are aligned in Figure 1(B). Thus, these three states are collapsed into one 
state, which determines whether these aligned nucleotides are background 
or binding sites of a motif. (Note that these aligned nucleotides in different 
orthologs are not necessarily identical.) Here hidden states refer to the de- 
composed states, that is, B and Mo to Mr-, which specify the locations of 
modules and motif sites. This coupled hidden Markov model (c-HMM here- 
after) has a natural graphical model representation [lower panel of Figure 
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Fig. 1. The coupled hidden Markov model (c-HMM). (A) The HMM for module struc- 
ture in one sequence. (B) Multiple alignment of three orthologous sequences (upper panel) 
and its corresponding graphical model representation of the c-HMM (lower panel). The 
nodes represent the hidden states. The vertical bars in the upper panel indicate that the 
nucleotides emitted from these states are aligned and thus collapsed in the lower panel. 
Note that a node will emit Wk nucleotides if the corresponding state is Mk (k= 1, . . . , K). 
(C) The evolutionary model for motifs using one base of a motif as an illustration. The 
hidden ancestral base is Z, which evolves to three descendant bases X^ and X^. 

Here the evolutionary bond between X^ and Z is broken, implying that X^ is indepen- 
dent of Z . The bond between X^ and Z and that between X^ and Z are connected, 
which means that X m = X {3) = Z. 



1(B)], in which each state is represented by a node in the graph and the 
arrows specify the dependence among them. The transition (conditional) 
probabilities for nodes with a single parental node are defined by the same 
T in (1). We define the conditional probability for a node with multiple par- 
ents as follows: If node Y has m parents, each in state Yi (i = 1,2, . . . ,m), 
then we have 

(2) P(Y\Y X , ...,Y m ) = -f^ P(Y\Y) = —T(B, Y) + ^T(M, Y), 

to f— * m m 

i=i 

where Cb and Cm are the numbers of the parents in states B and M, respec- 
tively (to = Cb + Cm)- This equation shows that the transition probability 
to a node with multiple parents is defined as the weighted average from the 
parental nodes in background states and module states. We use the same 
emission model described in the previous section for unaligned states. For 
aligned (coupled) states, we assume star-topology evolutionary models with 
one common ancestor, although the method can be readily generalized to 
a tree topology. The c-HMM first emits (hidden) ancestral nucleotides by 
the emission model defined in Figure 1(A), given the coupled hidden states. 
Then, different models are used for the evolution from the ancestral to de- 
scendant nucleotides depending on whether they are background or TFBSs. 
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2.3. The evolutionary model. A neutral substitution matrix is used for 
the evolution of aligned background nucleotides, both within and outside of 
modules, with a transition rate of a and a transversion rate of (3: 



(3) 



I -fib P 

(3 I- fib 

a (3 
(3 a 



a (3 
j3 a 
I- fib (3 

(3 I- fib 



where the rows and columns are ordered as A, C, G and T, and fib = a + 2(3 
is defined as the background mutation rate. We assume an independent evo- 
lution for each position (column) of a motif under the nucleotide substitution 
model of Felsenstein (1981), which was also used in Sinha, van Nimwegen 
and Siggia (2003). Suppose the weight vector of a particular position in the 
motif is 9. The ancestral nucleotide, denoted by Z, is assumed to follow a 
discrete distribution with the probability vector 9 on {A,C,G,T}. If X is 
a corresponding nucleotide in a descendant species, then either X inherits 
Z directly (with probability fij) or it is generated independently from the 
same weight vector 9 (with probability 1 — fif). The parameter fit, which 
is identical for all the positions within a motif, reflects the mutation rate of 
the TFBSs. This model takes PWM into account in the binding site evolu- 
tion, which agrees with the nonneutral constraint of TFBSs that they are 
recognized by the same protein (TF). It is obvious that, under this model, 
the marginal distribution of any motif column is identical in all the species. 
This evolutionary model introduces another hidden variable which indicates 
whether X is identical to or independent of Z for each base of an aligned 
TFBS. We call these indicators evolutionary bonds between ancestral and 
descendant bases [Figure 1(C)]. If X = Z, we say that the bond is connected; 
if X is independent of Z, we say that the bond is broken. 



3. Gibbs sampling and Bayesian inference. 



3.1. Basic framework. Our full model involves the following parameters: 
the transition matrix T defined in equation 1, the mixture emission proba- 
bilities <7oj<Zi) • • • ,1k, the motif widths w±, . . . , wk , the PWMs &±, . . . ,@k, 
the background models for ancestral nucleotides and all current species, and 
the evolutionary parameters a, (3 and fif. We take as input the number of 
TFs, K and the expected module length, L, and fix the transition probabil- 
ity t = 1/L in T. Compared to the HMx model in Zhou and Wong (2004), 
this model has three extra free parameters, a, j3 and fif, related to the evo- 
lutionary models. Independent Poisson priors are put on motif widths and 
flat Dirichlet distributions are used as priors for all the other parameters. 
With a given alignment for each ortholog group, we treat as missing data 
the locations of modules and motifs (i.e. the hidden states), the ancestral 
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sequences and the evolutionary bonds. We develop a Gibbs sampler (called 
MultiModule, hereafter) to sample from the joint posterior distribution of 
all the unknown parameters and missing data. To consider the uncertainty 
in multiple alignment, we adopt an HMM-based multiple alignment [Baldi et 
al. (1994), Krogh et al. (1994)] conditional on the current parameter values. 
This is achieved by adding a Metropolis-Hastings step in the Gibbs sampler 
to update these alignments dynamically according to the current sampled 
parameters, especially the background substitution matrix $ [equation (3)]. 
In summary, the input data of MultiModule are groups of orthologous se- 
quences, and the program builds an initial alignment of each ortholog group 
by a standard HMM-based multiple alignment algorithm. Then each itera- 
tion of MultiModule is composed of three steps: (1) Given alignments and 
all the other missing data, we update motif widths and other parameters by 
their conditional posterior distributions; (2) Given current parameters, with 
probability u, we update the alignment of each ortholog group; (3) Given 
alignments and parameters, a dynamic programming approach is used to 
sample module and motif locations, ancestral sequences and evolutionary 
bonds. (See the Appendix for the details of the Gibbs sampling of Multi- 
Module.) The probability u is typically chosen in the range [0.1,0.3]. 

Motif and module predictions are based on their marginal posterior dis- 
tributions constructed by the samples generated by MultiModule after some 
burn- in period (usually the first 50% of iterations). We estimate the width 
of each motif by its rounded posterior mean. We record the following pos- 
terior probabilities for each sequence position in all the species: (1) Pk, the 
probability that the position is within a site for motif k, that is, the hid- 
den state is (k = 1, 2, . . . , K); (2) P m , the probability that the position 
is within a module, that is, the hidden state is M; (3) P a , the probability 
that the position is aligned. All the contiguous segments with P^ > 0.5 are 
aligned (and extended if necessary) to generate predicted sites of motif k 
given the estimated width w^. The corresponding average P a over the bases 
of a predicted site is reported as a measure of its conservation. We collect 
all the contiguous regions with P m > 0.5 as candidates for modules, and 
a module is predicted if the region contains at least two predicted motif 
binding sites. The boundary of a predicted module is defined by the first 
and last predicted binding sites it contains. We use 0.5 as the threshold for 
posterior probabilities. This threshold determines the trade-off between the 
sensitivity and the specificity of the predictions. In our experience, values 
in the range of [0.5,0.7] for the threshold usually give good performance 
for the posterior inference on the model. Smaller threshold often results in 
many false positive predictions. 

Under the c-HMM, if we fix r = 1 — t = 1 in the transition matrix T [equa- 
tion (1)], then MultiModule reduces to a motif discovery method, assuming 
the existence of K motifs in the sequences. This setting is useful when the 
motifs do not form modules, and we call it the motif mode of MultiModule. 
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3.2. Multimodality and combined prediction. Although MultiModule con- 
verges to the target posterior distribution eventually, from the examples in 
Section 5 we find that it usually reaches some local mode quickly and then 
moves around the mode for a long time. Since the waiting time for between- 
mode transition is exponentially long, we often run multiple short chains for 
MultiModule instead of one long chain, that is, we apply MultiModule to a 
particular data set multiple times with random initialization. In this way, 
it has a much greater chance to explore different major local modes which 
often correspond to different motif compositions of predicted modules. How- 
ever, we note that our module prediction is quite consistent and that major 
motifs are usually predicted repeatedly in multiple runs (see Section 5.1). 
We employ a heuristic to select representatives of these motifs by ranking 
all the predicted motifs according to a Bayesian score derived in Jensen et 
al. (2004): 



(4) Score = n 



1.5w log(n + 3), 



where j = A, C, G, T, n is the number of predicted sites, 6 is the estimated 
motif weight matrix constructed by predicted sites, 9q is the background 
distribution, and w is the width of the motif. The parameter p (<1) is the 
prior odds of observing a motif site over a background nucleotide. We take 
p = 1/500, which gives good balance between the specificity of a predicted 
motif pattern and the number of predicted sites. Using the average P m of 
each sequence position over these multiple runs and the top K distinct motifs 
ranked by (4), we define combined-predicted modules by the same criterion 
introduced in Section 3.1. 



4. Simulation studies. Transcription factors Oct4, Sox2 and Nanog are 
believed to cooperate in the regulation of genes important to self renewal 
and pluripotency of embryonic stem (ES) cells [Boyer et al. (2005)]. We 
used the following model to simulate data sets in this study: We generated 
20 hypothetical ancestral sequences, each of length 1000 bps. Twenty mod- 
ules, each of 100 bps and containing one binding site of each of the three 
TFs, were randomly placed in these sequences. TFBSs were simulated from 
their known weight matrices with logo plots [Schneider and Stephens (1990)] 
shown in Figure 2. Then based on the choices of the background mutation 
rate pb [with a = 3/3 in equation (3)] and the motif mutation rate pf, we 
generated sequences of three descendant species according to the evolution- 
ary models in Section 2.3. The indel (insertion-deletion) rate was fixed to 
0.1 pb- After the ancestral sequences were removed, each data set finally con- 
tains 60 sequences from three species. Our simulation study was composed 
of two groups of data sets, and in both groups we set Pf = 0-2pb but varied 
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Table 1 
Results for the simulation study 



Oct4 (60) 
N2/N1 



Sox2 
(60) 



N 2 /N! 

51.6/66.0 

52.2/91.3 
42.4/89.6 

22.7/30.6 
34.0/51.1 
14.2/18.2 



Nanog 

(60) 
iVa/iVx 

40.8/46.3 

27.6/37.8 
23.6/39.0 

21.0/33.9 
21.7/31.6 
8.3/14.2 



Three motifs in total (180) 



Sensitivity 


Specificity 


Overall 


73% 


77% 


75% 


60% 


62% 


61% 


49% 


53% 


51% 


35% 


70% 


49% 


36% 


58% 


46% 


15% 


68% 


32% 



(A) 
(B) 
(C) 

(A) 
(B) 
(C) 



38.7/57.4 
27.7/45.3 
22.2/36.7 

18.8/24.8 
9.3/29.4 
5.1/8.1 



NOTE: N2 and Ni refer to the numbers of correct and total predictions for each motif, 
respectively. TF names are followed by the numbers of true sites in parentheses. The upper 
and lower halves refer to the average results over 10 independently generated data sets 
with nt — 0.1 and 0.4, respectively. "Overall" is the geometric average of sensitivity and 
specificity. For each data set, the optimal results (in terms of overall score) among three 
independent runs under the same parameters were used for the calculation of averages. 
Parameter sets (A), (B), (C) are defined as follows: (A) Module mode, L = 100, u = 0.2; 
(B) Motif mode, u = 0.2; (C) Motif mode, u = 0. 



the value of In the first group, we set m, = 0.1 to mimic the case where 
species are evolutionarily close. In the second group, we set //& = 0.4 to study 
the situation for remotely related species. For each group we generated 10 
data sets independently. 

We applied MultiModule to these data sets under three different sets of 
program parameters: (A) Module mode, L = 100, u = 0.2; (B) Motif mode, 
u = 0.2; (C) Motif mode, u = 0. For each set of parameters, we ran Mul- 
tiModule for 2,000 iterations with K = 3, searching both strands of the 
sequences. Initial alignments were built by ordinary HMM-based multiple 
alignment methods. If u = 0, these initial alignments were effectively fixed 
along the iterations. 

The results are summarized in Table 1, which includes the sensitivity, the 
specificity and an overall measurement score of the performance, defined 




FZ W !£• IEr *— 



u 
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(A) 



(B) 



(C) 



Fig. 2. Logo plots for the motifs in the simulated studies: (A) Oct4, (B) Sox2 and (C) 
Nanog. 
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Table 2 

Selected posterior distributions of motif widths for Oct4, Sox2, Nanog and a 

spurious motif 



Width 


Oct4 


Sox2 


Nanog 


Spurious 


6 











0.755 


7 


0.053 


0.859 





0.198 


8 


0.891 


0.141 





0.036 


9 


0.056 





0.159 


0.011 


10 








0.841 





[11,15] 















NOTE: The true motif widths are 8, 7 and 10 for Oct4, Sox2 and Nanog, respec- 
tively (see Figure 2). 



as the geometric average of the sensitivity and specificity. This overall score 
equals zero if either sensitivity or specificity is zero; it equals 1 if both of them 
are 1; when sensitivity = specificity = x, the overall score equals x. These 
properties make it a good overall measurement of predictions. One sees that 
updating alignments improves the performance for both = 0.1 and 0.4, 
and the improvement is more significant for the latter setting [cf. results of 
(B) and (C) in Table 1]. The reason is that the uncertainty in alignments for 
the cases with /x& = 0.4 is higher than that for /i& = 0.1, and thus updating 
alignments, which aims to average over different possible alignments, has 
a greater positive effect. Considering module structure shows an obvious 
improvement for /x& = 0.1, but it is only slightly better than running the 
motif mode for fib = 0.4 [cf. (A) and (B) in Table 1]. We noticed that for 
Hb = 0.4, MultiModule found all the three motifs under both parameter 
settings [(A) and (B)] for five data sets, and the predictions in (A) with an 
overall score of 70% definitely outperformed that in (B) with an overall score 
of 58%. For the other five data sets, no motifs were identified in setting (A), 
but in setting (B) (motif mode) subsets of the motifs were still identified 
for some of the data sets. We suspect that this was caused by the slower 
convergence of MultiModule in setting (A), because of its higher model 
complexity, especially when the species are farther apart. One possible quick 
remedy of this is to use the output from setting (B) as initial values for 
setting (A), which will be a much better starting point for the posterior 
sampling. 

In all the simulation studies, motif widths were updated in the range of 
[6,15] by a Metropolis-Hastings step (Section A. 4). To illustrate the pos- 
terior inference of motif width in MultiModule, we report in Table 2 the 
histograms of the motif widths of the three TFs from a single run of 1000 
Monte Carlo samples after burn-in for one of the simulated data sets with 
Hb = 0.1, where all the three motifs were unambiguously identified. The 
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posterior probabilities were all concentrated on their respective true motif 
widths. On the other hand, we also report in this table the motif width 
distribution when MultiModule output a false motif (i.e., it was none of the 
three true motifs). One sees that, in this case, the motif width was mostly 
sampled as w = 6 and decayed very fast for w > 6. This was due to the fact 
that we restricted the motif width to be between 6 and 15. If we removed 
this restriction, the width would further decrease to smaller values, which 
would be a good indication that this motif might be spurious. 

5. Applications to biological data sets. We tested MultiModule on two 
annotated data sets from mammals and fruit flies. Our computational pre- 
dictions were compared to experimental validations reported previously. 
Detailed comparison with several published motif and module discovery 
methods was conducted based on these data sets. Hereafter we say that 
a predicted site is a correct prediction or that a predicted site overlaps an 
experimentally verified site if the starting position of the predicted site is 
within 3 bps to that of a verified site. This definition is used for assessing 
the performance of all the computational methods mentioned in this article. 
In the following examples, we set u = 0.2 to update ortholog alignments in 
MultiModule. 

5.1. Muscle- specific genes in mammals. Our first test data set is the 24 
skeleton-muscle-specific genes of human and mouse orthologs [Wasserman 
et al. (2000)]. We combined putative dog orthologs based on UCSC genome 
alignment (http://www.genome.ucsc.edu). The muscle-specific expression 
of these genes is controlled by five TFs, MEF, MYF, SP1, SRF and TEF, 
with 16, 25, 21, 14 and 7 experimentally validated binding sites in the human 
genes, respectively [Thompson et al. (2004)]. These binding sites form 24 val- 
idated modules. Here a validated module is defined as a sequence fragment 
containing at least two TFBSs satisfying the condition that the distance 
between any two neighboring sites is < 100 bps. The module boundaries are 
defined by the first and last TFBSs it contains. The total length of these 
modules is 2716 bps, distributed in 21 human genes. Upstream 3 kb of the 
human, mouse and dog orthologs were extracted and aligned by mLAGAN 
[Brudno et al. (2003)]. Based on these alignments, we calculated a conserva- 
tion score (CS) for each sequence position. Using a threshold, we N-masked 
nonconserved bases which are more than 1 kb from the TSSs (transcrip- 
tion start site), that is, all the bases within 1 kb are kept irrespective of 
their CSs. The purpose is to detect promoter and conserved distal enhancer 
elements. Repeats were masked by "N" using a repeat-masking program 
(http://repeatmasker.org/). The preprocessing reduced our data set to 
an average effective length ("N"s are not counted) of 1604 bps per sequence. 
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Fig. 3. Motif and module predictions for the muscle- specific gene set. Logo plots for (A) 
experimentally validated and (B) de novo predicted binding sites for MEF, MYF, SPI 
and SRF in the human genes. (C) The Bayesian inference for the gene TNNI1 in the 
human (top), mouse (middle) and dog (bottom) orthologs. The green curved lines report 
the posterior module probabilities (P m )- Conserved regions (defined as posterior alignment 
probability P a > 0.5,) are indicated by brown bars and dots, of which the vertical heights 
report the corresponding values of P a . Predicted motif sites and experimentally validated 
sites are indicated by vertical lines and bars, respectively. 



We applied MultiModule with K = 5 and L (expected module length) = 
200 bps. Our pilot study suggests that running multiple short chains is more 
efficient for MultiModule (see supplemental notes). Thus, we ran the pro- 
gram 50 times independently and 1000 iterations each run. All the predicted 
motifs were ranked by their Bayesian scores [equation (4)], and the top five 
distinct motifs corresponded to the binding patterns of SRF, MEF, SPI and 
MYF [Figure 3(A) and (B)], and an AC-rich motif which seems to be a repet- 
itive pattern. We note that MultiModule failed to discover any motifs close 
to that of TEF, mainly due to the fact that the TEF sites are not enriched 
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enough in this data set (only seven sites in 24 sequences). For the other four 
known TFs, MultiModule predicted a total of 97 sites in the human genes 
from the top five ranking motifs, and 45 of them overlap corresponding vali- 
dated sites. Thus, it achieves a sensitivity of 59% and a specificity of 46% for 
these four motifs (Table 3). Note that the specificity is likely to be underesti- 
mated since our predictions may include some functional binding sites which 
have not been experimentally validated yet. The 50 runs contained at least 
two distinct modes in module composition, denoted by mode A and mode 
B. In mode A, MultiModule found five motifs corresponding to MEF, MYF, 
SP1, SRF and the AC-rich motif. In mode B, MultiModule found four mo- 
tifs, including MYF, SP1, MEF and the AC-rich motif, with some validated 
SRF sites contained in the predicted MEF sites, that is, one motif was a 
mixture of MEF and SRF. One sees that these two motifs are both AT-rich 
in the middle [Figure 3(A)]. Please note that it is possible that MultiModule 
output fewer motifs than the pre-specified K, when the posterior motif site 
probabilities Pf- of all sequence positions are <0.5 (the posterior probability 
threshold) for some k. We randomly selected one representative from each 
mode and checked their overlaps in base pair level with validated modules. 
Both modes predicted modules at a sensitivity of 65% and a specificity of 
41% approximately (Table 4). We averaged posterior module probabilities 
over the 50 independent runs for each sequence position and used these av- 
erage P m 's to generate our combined predictions with the top five predicted 
motifs. This combined prediction shows a significant improvement in speci- 
ficity without loss of much sensitivity (Table 4). The Bayesian inference by 
marginal posterior probabilities is illustrated in Figure 3(C) using the gene 
TNNI1 as an example, which contains two known modules. One sees that 
(1) high peaks of P m emerge at the two known modules; (2) the posterior 
module and motif probabilities are coupled in conserved regions and thus 
show similar shape among the orthologs. 

We note that the bases in the combined-predicted modules showed a very 
high average P m and the P m values for 82% of them were >0.9 [Figure 4(A)]. 
This implies that the module predictions were quite consistent among inde- 
pendent runs despite the slightly different motif composition. The average 
Pa's of predicted motifs and modules (Tables 3 and 4) were much higher 
than the overall average of all the sequence positions (58%), which indicates 
that functional elements are more likely to be located in aligned blocks. 
We observed that the background mutation rate \Xb was significantly higher 
than that of TFBSs //y [Figure 4(B)]. From the respective definitions of \Xh 
and fj,f, one sees that, for a TFBS, the equivalent mutation rate compa- 
rable to the definition of fib is < (i>f- Thus, the observation that jib > [if 
convinces that even within aligned blocks, TFBSs still show a significantly 
lower evolutionary rate than their surrounding background nucleotides. 
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Table 3 

Predicted motifs in the muscle- specific genes 



Human 



TFs 



Validated Predicted Overlaps 



Mouse 
Pre- 
dicted 

26 

35 
21 
22 
104 



Dog Pre- 
dicted 



Avg. 

Pa 



MEF 

MYF 

SP1 

SRF 

Total 



16 
25 
21 
14 
76 



25 (10*) 
34 
16 
22 
97 



13 (7*) 
13 
8 
11 
45 



24 
31 
16 
22 
93 



9 
97 
70 
98 
93 



NOTE: Tabulated are the numbers of validated TFBSs, predicted sites and overlapping 
sites (i.e., correct predictions) for the human sequences. We also include the total number 
of predicted sites in the mouse and dog sequences. 

'Among the predicted MEF sites, 10 of them overlap the predicted SRF sites, in which 
seven turn out to be experimentally validated SRF binding sites. 



5.2. Early developmental genes in Drosophila. Regulatory regions that 
control early body development in Drosophila melanogaster (Dm, hereafter) 
were identified in previous experimental studies [e.g., Berman et al. (2002)]. 
As a test of MultiModule, we extracted all the identified regulatory regions 
that interact with at least one of the three TFs, Bicoid (Bed), Hunchback 
(Hb) and Kriippel (Kr), which form complex combinatorial patterns that 
regulate early developmental genes. These extracted regions form a data set 
of 26 Dm sequences. We further extracted orthologous regions in Drosophila 
pseudoobscura (Dp, hereafter) based on UCSC genome alignment and ob- 
tained 23 of them. Thus, our full data set contains 49 sequences with an 
average length of 1209 bps. In this data set, 34 functional binding sites in 
Dm with experimental validations are available based on TRANSFAC 9.1 
release [Wingender et al. (2000)]: 12 Bed sites in three sequences, 14 Hb sites 



Table 4 

Predicted modules in the muscle-specific genes 





Predicted 


Overlaps 


Sensitivity 


Specificity 


Avg. 




modules (bps) 


(bps) 


(%) 


(%) 


P a (%) 


Mode A 


4159 


1728 


64 


42 


85 


Mode B 


4412 


1768 


65 


40 


87 


Combined 


2835 


1566 


58 


55 


89 



NOTE: Tabulated are the total numbers of bases in the predicted modules in the human 
gene set with validated modules (the 21 human genes with at least one defined module 
of validated TFBSs). Overlaps report the number of overlaps between the predicted and 
validated modules in base pair level. Sensitivity = (# overlaps) / (# bases in all validated 
modules = 2716 bps). Specificity = (# overlaps) / (# bases in the predicted modules). 
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Fig. 4. (A) Histogram of the average posterior module probabilities (P m ) over the 50 
independent runs of all the bases in the combined-predicted modules. (B) Posterior distri- 
butions of the background and motif mutation rates, fj,i and fif, respectively, calculated by 
pooled samples from the 50 runs. 



in four sequences, and 8 Kr sites in two sequences. The genes with validated 
sites are referred to as validated gene set in this example. 

We applied MultiModule to this data set with K = 3 and L = 200 for 50 
independent runs, each 1000 iterations. We ranked all the predicted motifs 
in multiple runs by the same Bayesian score [equation (4)], and the top two 
distinct motifs correspond to the known Hb and Kr binding patterns [Figure 
5(A) and (B)]. The third motif resembles the known Bed binding pattern 
[Figure 5(C)]. Our predicted motif binding sites overlap substantially with 
validated ones for the three TFs with an overall sensitivity of 44% and 
specificity of 47% (Table 5). Some of our predicted Kr sites turned out to 
be validated Bed sites, which is consistent with the fact that these two 
TFs actually bind to some overlapping functional sites. Such sites are not 
counted as correct predictions for Bed in Table 5. Since the majority of 
the sequences in this data set are not annotated for TFBSs, we include in 
Table 5 the sum of squared distances (SSD) between the predicted and the 
known motif weight matrices as another quality measure of the predictions. 
These SSDs are approximately 1/10 of the expected SSDs between a random 
weight matrix and the known ones for the three TFs. 

We combined all the 50 independent runs to generate our combined pre- 
dictions. In this way, we predicted 70 modules covering 16855 bps (Table 
6): 86%, 49% and 30% of these modules contain the predicted Hb, Kr and 
Bed binding sites, respectively. We used these combined predictions to de- 
fine regulatory interactions among the maternal gene Bed and the four gap 
genes in our data set, Hb, Kr, Gt and Kni, which are the most important 
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Fig. 5. Results of the Drosophila early developmental genes. Logo plots for the known 
(upper panel) and the predicted (lower panel) motifs in Dm for the three TFs: (A) Hb, 
(B) Kr and (C) Bed. The predicted motifs in Dp of these three TFs are identical to those 
in Dm. (D) The predicted regulatory network of the genes Bed, Hb, Kr, Gt and Kni in 
early Drosophila body patterning. An arrow from gene Y to gene X indicates that gene Y 
regulates gene X based on our predicted modules. 



TFs controlling early body patterning in Drosophila. Known regulatory in- 
teractions among these genes are available based on previously reported 
experiments [Sanchez and Thieffry (2001)]. In our simplified analysis, gene 
X is defined to be regulated by gene Y if the regulatory region of gene X 
contains a predicted module composed of at least one binding site for the 
TF encoded by gene Y. Remarkably, all our predicted interactions [Figure 
5(D)] are exactly identical to those known ones, and our analysis recovers all 
the known interactions with Bed, Hb and Kr as regulators. In this data set, 
the predicted motifs and modules also showed higher conservation (Tables 
5 and 6) compared with the overall average P a of 33%. 
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Table 5 

Motifs predicted in the Drosophila early developmental genes 



run uin run uy jr a 

TFs Validated Predicted Overlaps Predicted Predicted (%) 

(SSB) (SSB) 

Hb 14 18 9 156 (0.31) 143 (0.28) 57 

Kr 8 4 4 54 (0.41) 37 (0.38) 53 

Bed 12 10 2 65 (0.36) 60 (0.41) 75 

Total 34 32 15 275 240 56 



NOTE: Predictions in validated gene set and in full gene set are tabulated. SSD is the sum 
of squared distance between a predicted and the corresponding known weight matrices. 
Other fields are defined similarly to those in Table 3. Validated genes in Dm refer to the 
genes with known TFBSs for the corresponding TFs. Exact binding sites are not available 
for the remaining genes although they are known to be controlled by these TFs. 

5.3. Comparing with other methods. We compared the performance of 
MultiModule on the two data sets with that of AlignACE [Roth et al. 
(1998)], CompareProspector [Liu et al. (2004)], EMnEM [Moses, Chiang 
and Eisen (2004)] and CisModule [Zhou and Wong (2004)]. AlignACE finds 
multiple motifs using a repeatedly masking strategy. We ran the program 
under its default setting to search for motifs with different combinations of 
input parameters for motif width and expected number of sites. For these two 
data sets, AlignACE output 40 to 60 motifs for each run, and we repeated 
the program five times for each data set independently, which generated 
around 250 motifs. CompareProspector searches for motifs in one species 
with a given conservation score for each sequence position calculated based 
on a multiple alignment. We ran CompareProspector with each known motif 
width and a total of 300 independent runs for each data set. EMnEM takes 
as input one alignment for each ortholog group with a given phylogenetic 
tree. We input the alignments built by mLAGAN [Brudno et al. (2003)] and 
used a phylogenetic tree with branch length (in the unit of substitution per 



Table 6 

Predicted modules in the Drosophila early developmental genes 





Modules 


Modules 


Pa 


Motifs 


in % of modules 


TFs 


(#) 


(bps) 


(96) 


Hb (%) 


Kr (%) 


Bed (%) 


Dm 


37 


8650 


42 


84 


54 


30 


Dp 


33 


8205 


45 


88 


42 


30 


Total 


70 


16855 


43 


86 


49 


30 



NOTE: Tabulated are the number and total length (in the unit of base pair) of the 
predicted modules, their average alignment probability (P a ) and the percentage of the 
modules containing each detected motif. 



18 



Q. ZHOU AND W. H. WONG 



site) estimated by Xie et al. (2005) from mammalian upstream sequences 
for the first data set. We ran EMnEM with each known motif width w and 
set all the u;-mers as initial consensus. For the above three methods, we 
defined representative output for each known motif by the highest score 
predicted motif (as reported by respective programs) that match the known 
pattern. CisModule is a single-species module discovery method. We ran 
it 50 times independently under the same parameters {K and L) as those 
used in MultiModule. The predicted motifs of CisModule were ranked by 
the same score function [equation (4)]. For all the methods, we used exactly 
the same sequences after preprocessing as described in the previous sections. 

The two test data sets in total contain 117 validated TFBSs for the eight 
known motifs. Table 7 summarizes the performance of each method on these 
data sets with respect to motif identification and site prediction. Multi- 
Module identified more known motifs than all the other methods. It also 
found much more validated TFBSs than the other methods did, with at 
least about 70% of improvement in sensitivity: MultiModule detected 60 
validated TFBSs, while the other methods detected at most 35 validated 
sites. In addition, the specificity of MultiModule is the highest one among 
all the programs for these tests. This indicates that the high sensitivity of 
MultiModule does not come at the expense of specificity. In terms of overall 
performance, MultiModule shows 53% to 81% of improvement compared to 
the other four methods. 

Table 7 

Performance comparison on the two test data sets 



For correctly identified motifs 
# correctly 

identified # of # of Sensitivity Specificity Overall 

Methods motifs predicted overlaps (%) (%) (%) 



ALnACE 6 106 31 26 29 28 

CompPr 6 64 27 23 42 31 

EMnEM 5 102 35 30 34 32 

CisMod 5 110 35 30 32 31 

MltMod 7 129 60 51 47 49 



NOTE: We compare different methods based on their predictions with respect to the vali- 
dated gene sets (i.e., genes with validated TFBSs) which are exactly the same as those used 
in the applications of MultiModule to these data sets. "ALnACE," "CompPr," "EMnEM," 
"CisMod" and "MltMod" refer to AlignACE, CompareProspector, EMnEM, CisModule 
and MultiModule, respectively. We report the number of known motifs each method identi- 
fied. For correctly identified motifs, the numbers of predicted sites and correct predictions 
(# overlaps) are reported. Sensitivity is calculated by (# overlaps) / (total number of 
validated TFBSs = 117). Specificity is calculated by (# overlaps) / (# predicted sites). 
Overall score is defined as the geometric average of sensitivity and specificity. 
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6. Discussion. We have proposed and illustrated a new computational 
approach based on a coupled hidden Markov model for de novo discovery 
of CRMs and motifs in sequences from multiple species. Our simulation and 
test results convey three pieces of information about this approach. First, 
modeling sequence orthology provides more information than using a con- 
servation score or simply pooling sequences from multiple species into a 
heterogeneous data set as illustrated from the comparison with other meth- 
ods. Second, the use of module structure to identify clusters of motif pat- 
terns is usually more powerful than identifying each motif independently. 
Third, updating multiple alignments improves the sensitivity of motif pre- 
diction. From this study, we observe that for species within mammals or 
within Drosophila, aligned regions are usually much longer than the width 
of TFBSs, and TFBSs definitely show lower mutation rates compared with 
aligned background nucleotides as shown in Figure 4(B), which is also true 
for the Drosophila data set. 

Since MultiModule samples from a complicated joint probability distri- 
bution by a Gibbs sampler, the problem of multimodality needs to be con- 
sidered. We observe that it is helpful to integrate out weight matrices and 
other parameters even in an approximate sense (Section A. 6) for the con- 
vergence of MultiModule. To further alleviate the possibility of local traps, 
we combine samples from multiple randomly initialized chains to construct 
superior estimates in practice. An alternative approach to this end is the use 
of more sophisticated Monte Carlo algorithms to handle multimodality, such 
as the parallel tempering [Geyer (1991)] or the equi-energy sampler [Kou, 
Zhou and Wong (2006)]. The computational complexity of MultiModule is 
approximately proportional to 2 N KL, where N is the number of species, 
K is the number of TFs, and L is the total length of sequences, which is 
scalable with a reasonable selection of orthologous species. It is worth men- 
tioning that the complexity of the motif mode of MultiModule is linear in 
N . The use of c-HMM is not restricted to de novo discovery. With given 
PWMs and other parameters, MultiModule can be used to scan for modules 
in ortholog groups. From our experience, for de novo motif finding, Multi- 
Module is suitable to handle sequences with an average length less than 2 
kb. If the average search region is much larger, some preprocessing is needed 
to reduce it to save the computational cost and to increase the motif site 
enrichment. This was the reason why we removed the nonconserved bases 
in the first application of the muscle-specific genes. 

For the current implementation of MultiModule, we assume that the num- 
ber of motifs K is fixed and known. But in real module discovery applica- 
tions, the exact value of K is usually unknown. It would be desirable to 
develop a coherent approach to estimate this number simultaneously. How- 
ever, up to now, we do not have an efficient method to complete this task. 
The dimensionality of the parameter space is determined by K. With all 
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kinds of missing data in our model, it is impossible to integrate out both 
the parameters and the missing data to obtain the marginal distribution of 
K. Thus, we decide to fix K for the current implementation. When K is 
set to be smaller than the true number of motifs, the program usually finds 
subsets of the motifs that are highly enriched (some repetitive patterns may 
be included as well, such as the AC rich pattern in the first application). 
When K is set to be greater than the true value, the program may output 
fewer than K motifs, as we pointed out in Section 5.1. Our suggestion is 
to apply some pilot runs of MultiModule with different K in a reasonable 
range, say, between 2 and 6, and check how many distinct motifs it actually 
outputs. Then one may make a decision on the suitable value of K and per- 
form multiple runs to generate final predictions. In our experience, this ad 
hoc approach is usually acceptable for most applications. 

In this article the module structure is modeled by a one-step Markov chain 
with two states, which specifies a geometric distribution on the length of a 
module. This is definitely a simple approximation to the biological reality, 
which only captures the co-localization of multiple motif sites, that is, the 
model puts some constraints on the possible locations of the motif sites in 
the same module. Our current approach should be viewed as a very first 
step to incorporate module and phylogenetic structures into de novo mo- 
tif discovery. There exists substantial room for further development in the 
direction. One may extend the HMM used in this article to a higher-order 
Markov chain such that the transition probability to a motif depends on 
the previous motif in the module. This allows the model to estimate pos- 
sible synergistic interactions between neighboring motifs. Other refinement 
of the model, such as more comprehensive phylogenetic tree topology and 
more sophisticated background model, is expected to enhance the utility of 
this approach. However, with the increasing model complexity, the computa- 
tional efficiency and robustness of the statistical inference based on posterior 
sampling will be more challenging. 

APPENDIX: THE GIBBS SAMPLER FOR THE C-HMM 

Suppose the data set of interest contains sequences of n genes from N 
species, S\ m) (i = l,2,... ,n,m = 1,2,..., N). Without loss of generality, 
we assume n = 1, since the sampling procedure for the ith ortholog group 
is similar for all i. Thus, we simply denote the input sequences as S = 
[S«,...,SW]. Define 8 = [0 , 1? . . . , 0*], where 9 = [9®, 9^, . . . , 0^} 
denotes the background distributions for the ancestral sequence (9^) and 

the current species (9q ; m = 1,2,..., N), and is the PWM for the kth 
motif (fc = 1,2,..., K). Each background model is an i.i.d. multinomial dis- 
tribution. We estimate 9^ for the current species (m = 1,2,..., N) before 
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the Gibbs sampling iteration, and thus effectively assume that they are given. 
Let W = [wo, ... , wk] be the widths of the background model (too = 1) 
and the motifs (i.e., is a 4 x matrix). The transition probability 
matrix T is defined in (1), and we denote q= [qo, q\, ■ ■ . , qx]- The neutral 
evolution of background nucleotides is characterized by the parameter vector 
(fib = [1 — /ife, a, 2(3] [equation (3)]. The probability of breaking an evolution- 
ary bond between any base of an aligned TFBS and that of its ancestral site 
is fif. Denote the hidden states by Y, which indicate whether the observed 
nucleotides are located in a background (B) or module (M) region. For those 
in a module, the hidden states Y also specify whether they are within-module 
background (Mo) or motif sites (Mi to Mk)- Thus, the hidden states imply 
the locations of modules and motif sites. We further use A to denote the 
multiple alignment of , . . . , , Z to denote the ancestral sequence and 
V to denote the evolutionary bonds of aligned TFBSs. Conceptually, we 
treat A, Y, Z and V as missing data and denote them by D m i s = [A, Y, Z, V] . 
All the parameters are denoted by ^ = [W,@,T,q,<pb, fif]. 

A.l. Prior and posterior distributions. MultiModule takes expected mod- 
ule length L and the number of motifs (TFs) K as input and fixes the transi- 
tion probability from a module state to a background state t = 1/L. We put 
independent Poisson priors with mean A = 10 for motif widths. Flat Dirichlet 
(Beta) priors are prescribed to all the other parameters. More specifically, 
we use a flat product Dirichlet of dimension 4 x Wk as the prior distribution 
for Ofc (k = 1,2, ... ,K) (i.e., the parameter for this product Dirichlet dis- 
tribution is a 4 x Wk matrix with all elements = 1). We put four-, (K + 1)- 

and three-dimensional flat Dirichlet priors on 9q , q and respectively. 
The prior distributions for r and \i$ are both Beta(l, 1). With these prior 
distributions specified, one can write down the joint posterior distribution 
of all the parameters and missing data, 

(5) P(tt, D mis \S) oc P(D mis , 5|*)tt(*), 

where 7r( 1 5 r ) denotes the joint prior distribution. Of interest are the locations 
of motif sites and modules, that is, the hidden states Y. One wants to 
perform inference based on the marginal posterior distribution of Y given 
all the sequence data S, 

(6) P(Y\S)ocf £ Pfy.A^V.S^^d*, 

J *A,z,v 

where all the other unknown variables are marginalized out. However, the 
integral in (6) has no analytical solution, and thus, we devise a Gibbs sam- 
pling approach to generate samples from the joint distribution [equation 
(5)]. Based on these samples, one can easily construct empirical marginal 
posterior distributions of interesting variables (Y in this case). 
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Fig. 6. The map from the alignment of three sequences in (A) to its path representation 
in 3-D space in (B). The path starts from (0, 0, 0), moves along A-B-C-D-E-F-G-H, 
and ends at (300, 300, 290). Shaded areas in (A) represent aligned blocks. Coordinates of 
aligned positions increase simultaneously along the path, such as those of sub-path D from 
(100, 90, 70) to (200, 190, 170). 

Our Gibbs sampler contains three steps: (1) sample A given ^; (2) sample 
\Y,Z,V] jointly given A and \&; (3) sample ^ given \Y,Z,V] and A. To 
simplify the description, we first discuss how to sample from the conditional 
distributions assuming that the alignment A is given. Then we discuss how to 
update A by one more conditional sampling step by a Metropolis-Hastings 
algorithm. We also describe how to integrate out all the parameters VI/ in 
(5) in an approximate manner, so that we are effectively sampling from 
P{D m is\S). We find this implementation of a collapsed Gibbs sampler [Liu 
(1994)] improves the convergence of MultiModule. 

A.2. Path representation of the c-HMM. Any multiple alignment of the 
orthologs S=[S ( - 1 \...,SW] can be viewed as a path in the A-dimensional 
space from (0, 0, ... , 0) to (L\, L2, . . . ,Ljv), with L m the length of S^ m \m = 
1,2, ... , N). The path is composed of a series of A-dimensional points, and 
the coordinates of each point are the last visited positions of these N se- 
quences. Figure 6 shows the map from a multiple sequence alignment to a 
path in 3-D space, which visits all the sequence positions in a unique order. 
This is a natural generalization of the 2-D path representation of a pair- wise 
alignment. 

Suppose the current alignment path A = a±a2 ■ ■ ■ clua)-, where L{A) is the 
total number of points and is the tfth point in this path [d = 1, 2, . . . , 

Denote the coordinates of by C(d) = [cjj , . . . , c^] , especially, C(l) = 
[0, . . . ,0] and C(L(A)) = [L±, . . . ,Ljy]- Define the emission components (EC) 
of citf as the subset of its coordinates which increase compared with the 
corresponding coordinates of a^-\, that is, EC(d) = {m|cA m ^ > cjj^m = 
1,...,N}. For instance, in Figure 6, the emission component of a point in 
sub-path A is the single element set {1}, since along this sub-path only the 
first sequence emits nucleotides, while the emission components of a point in 
sub-path D is {1,2,3}. We further denote the nucleotides emitted at ad by 
X(d) = {x d i\m G EC(d)}, where x d m ^ is the nucleotide at position in 
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S<- m \ Let \EC(d)\ denote the size of the emission components of ad, that is, 
X(d) contains nucleotides from \EC{d) \ species. If \EC(d)\ > 2, the emission 
components are aligned to each other at a^. Let Y(d) = [y^p , ■ ■ ■ , y d N ^] be 

the hidden states of ad, with y d m ^ the hidden state at c d (m = 1,2,..., N). 
The following constraints are implied by the definition of the c-HMM: 

(i) Vd = Vd \ if x< d and x d are aligned (i,j = 1, 2, . . . , N); 

(ii) y d m) = y d % if m^EC(d); 

(iii) y d m) =y d ™\ = -.. = y^ k _ ± = M k , if [c^c^.J is a binding site 
of motif k in S^ m \ 

Constraint (i) says that any aligned nucleotides share the same hidden state 
(the coupled state). Constraint (ii) is due to the fact that if m ^ EC(d), 
which implies that c d = cH, then x d and x^—i actually refer to the 
same nucleotide in S^ m ' and thus have the same hidden state. For example, 
in Figure 6, the third coordinates of the points in sub-path E are identical, 
and they all refer to position 170 of S^ 3 \ Constraint (iii) is consistent with 
the definition that a motif binding site is treated as one state (Mk) in our 
model. We define change points of the path by 

CP(A) = {a d | C(d + 1) - C(d) + C(d) - C(d - 1), 1< d < L(A)}, 

which are the points where the alignment path changes its direction in the 
^-dimensional space. For instance, all the points shown in Figure 6(B) are 
change points except the starting and ending points. By this path repre- 
sentation of an alignment, we effectively augment the hidden state to an 
iV-dimensional vector such that the model reduces to a Markov chain in the 
augmented state space. Then one may use the Markovian property to derive 
recursive algorithms to sample the hidden states exactly given parameter 
values and an alignment. 

A.3. Sample [Y,Z,V] given * and A. Denote Y[i,j) = [Y(i),Y(i + 
1), . . . ,Y(j)] and = [X(i), X(i + 1), . . . , X(j)] for any two points ai 

and a,j (j > i) in the path, especially, y[i,i] =Y(i) and =X(i). De- 

fine 

(7) f d (y)=P(Y(d)=y,X[l,d\\V,A), 

for d = 1, 2, . . . , L(A), where y is an ^-dimensional vector in the augmented 
state space. Let y H = {[y^ , . . . , yW] \ y^ = H, Vm G EC(d)} for H = B 
(background state) or M (module state) , which represents an iV-dimensional 
state vector with H as its components in EC(d). Then we have the following 
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recursive summations to calculate (7): 

(8) f d (y B ) = P(X(d)\e , fa) ]T Tr(y B \y, r, t)f d ^(y), 



K 



fdiVM) = 



k=0 



q k P(X[d-w k + l,d\\e k ,(j, f , 



(9) xJ2Tr(y M \y,r,t)f d - Wk (y) 

y 

where TV(yjj|y,r, t) specifies the transition probability from y to yn, and 
P(X(d)\9o,fa) and P(X[d — w k + 1, d]|6jfc, fJ<f, fa) are the marginal probabil- 
ities of emitting X{d) and X[d — w k + 1, d], given that the current state is 
B and M k (k = 0, 1, . . . , K), respectively. Recall that M k is the decomposed 
state of M, which indicates whether the current hidden state is within- 
module background (Mo) or one of the K motifs (Mi to Mr-). 

Since a TFBS is treated as one state as a whole, no change points are 
allowed in the interval [d — w k + 1, d] for k = 1, . . . , K in the summation of 

(9) . In other words, motif sites are not allowed to be located across any 
change points of the alignment. We calculate the transition probabilities in 
(8) and (9) according to the definition in (2), 

(10) Tr(y H \y,r,t) = \ ]T T(y^,H), for H = B or M, 

where y = [y^ l \ ■ ■ ■ ,y^] and = yjp for j ^ EC(d) due to constraint 

(11) described in the previous section. If ad- Wk is not a change point (k € 
{0, 1, . . . , K}), that is, EC(d) = EC(d — i), i = 1, 2, . . . , w k , equation (10) can 
be simplified as Tr(yij\y,r,t) = T(H',H), where ?/ m ) = H' for m E EC(d). 
For example, consider a point C(d) = [250,240,170] in sub-path E in Fig- 
ure 6, whose EC(d) = {1,2}. Suppose its hidden state is Y(d) = [B,B,M]. 
Since the previous point C(d — 1) = [249,239, 170] is in the same sub-path 
(not a change point), we know that the hidden state Y(d — 1) is of the form 
[y, y, M] with y = B or M, and 

Tr(Y(d) = [B, B, M] \Y(d - 1) = [y, y, M], r, t) = T(y, B). 

If Y(d) = [M k ,M k , M] (k = 0, 1, . . . , K), we will consider the point (d - w k ) 
in a similar way given that {d — w k ) £ sub-path E. 

If \EC(d)\ > 2, the current point a d emits a group of aligned nucleotides. 
In order to calculate the marginal emission probabilities, one needs to sum 
over all possible ancestral nucleotides. More specifically, 



(ii) p(x(d)\e ,fa) = Y, 



meEC(d) 
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where 9q (z) is the probability of observing nucleotide z under the ancestral 
background model, and &b is the neutral substitution matrix defined in (3). 
For k = 1,2 K, 



P(X[d-w k + l,d\\e k ,(i f ,<M 

(12) 

•uik 



n J2 @ ki{zi) n p(x { ^i k+i \zi,fx f ,e ki ) 

i=l L Zi m£EC(d) 



where Q k i is the weight vector at the ith position of motif k, that is, the 

M I 



ith. column of and P(xJ- w +i \zi,[if,@ki) is the probability that the 



P{x { ™ ] +i \Zi^f,® ki ) 



ancestral nucleotide Zi evolves to the descendant nucleotide x^j w +i under 
the evolutionary model for TFBSs, 

d—Wk+i 

(13) 

= (i - fi f ) ■ i(4_L fe+ i = zi) + M/®fci(4-l fe+ i)- 

For k = 0, equation (12) reduces to equation (11) for within-module back- 
ground. If \EC(d) \ = 1, the calculation reduces to single-species situations, 
such as in CisModule [Zhou and Wong (2004)]. 

Using the recursions of equations (8) and (9) along the alignment path A 
for d = 1, 2, . . . , L{A), one can calculate the marginal probability of observ- 
ing all the sequences S = [S^\ . . . , 5^] given the current parameters and 
alignment, that is, 

(14) P(S\% A) = P(X[1,L(A)]\% A)=Y, h(y), 

y 

where all the hidden states Y, ancestral nucleotides Z and evolutionary 
bounds V are summed over. 

Based on the partial summations calculated in equations (8) and (9), 
we sample [Y,Z,V] sequentially from d = L(A) to d= 1. Suppose we have 
sampled the hidden state of a^+i as Y(d + 1) = ■ • ■ , Vd+i] > where c^™{ 

is either a background nucleotide or the start position of a motif site in S( m ' 
for m = 1, . . . ,N. We sample the hidden state Y(d) = [y d , ■ ■ ■ ,y d ] of ad 
with probability proportional to fd(Y(d))Tr(Y(d + l)\Y(d),r,t) subject to 
constraints (i) and (ii) of the c-HMM. If the emission components of ad are 
sampled as background, we move to (d— 1) and repeat to sample Y(d— 1). If 
the emission components of ad are sampled as state M, we further sample the 
motif type (i.e., the decomposed states Mo, Mi, . . . , Mk) with probabilities 
proportional to the K+ 1 terms of fd(Y(d)) in (9) for k € {0, 1, . . . , K}. Given 
the imputed value of Y(d), we set Y(d — Wk + i) = Y{d) for i = 1, . . . ,w k — 1 
following constraints (hi) and (ii). Then we move to the point (d — w^) and 



26 



Q. ZHOU AND W. H. WONG 



repeat the sampling procedure for Y(d — w^). If \EC{d) \ > 2, we also sample 
associated ancestral nucleotides and evolutionary bonds according to the 
calculations in equation (11) to equation (13). Suppose the current emission 
components are background nucleotides (B or Mo). We sample the ancestral 
nucleotide Z d with probability 

p{z d = z)^e < i\z) J] <M*,4 m) )> 

m£EC(d) 

for z G {^4, C, G, T}. If the current emission components are binding sites of 
motif k, we sample each base of the ancestral binding site Zd-w k +i " " • %d 
independently with probabilities 

P{Z d - Wk+i = z) OC ® ki (z) Y[ ^ri-lfe+ik'M/' ^)) 

m&EC{d) 

for z G {A,C,G,T} and i = 1,2, . . . ,w k , where P{x^ Wk+i \z, Hf, Q ki ) is given 
by (13). Given the ancestral binding site, we update the evolutionary bond 
between the ancestral and current binding sites for each base independently. 
Denote the evolutionary bond between X£_ Wk+i [m £ EC(d)] and Z d _ Wk+i 
by . We connect the bond with probability 

p{y(m) = 1) = U ~ M/) • 1 ( X Ll fc +» = Z d-w k +i) _ 

otherwise the bond will be broken. 

A. 4. Sample ^ given [Y, V] and A. Let us return to the graphical 
representation of the c-HMM as illustrated in Figure 1(B). In this condi- 
tional sampling step, the hidden states, the ancestral nucleotides of cou- 
pled nodes and the evolutionary bonds associated with aligned TFBSs are 
given. We sample r from Beta{Csu + 1,Cbb + 1), where Cbm and Cbb 
are the numbers of transitions from B to M and from B to B, respec- 
tively. Denote the numbers of states B, M and M k by |£?|, \M\ and \M k \ for 
k = 0, 1, . . . ,K, respectively (|M| = J2 k= o\M k \). The conditional posterior 
distribution of q = [qo,qi, ■ ■ ■ ,qx\ is Dw(\Mq\ + 1, . . . , \Mk\ + 1)- We update 
the ancestral background distribution 9^ by Dir(C^ + 1), where is 
the count vector of the imputed ancestral background nucleotides and 1 is a 
vector of l's of length 4. Denote by Ci,C s and C v the numbers of identities, 
transitions and transversions from the ancestral to the current aligned back- 
ground nucleotides, respectively. Then we sample 4>b = [1 — fJ-b, a, 2(3] from 
Dir(Ci + 1, C s + 1, C v + 1). For all the aligned TFBSs with their imputed an- 
cestors, we count the numbers of broken and connected evolutionary bonds, 
| Vb| and |Vi|, respectively, and sample from Beta(\Vo\ + 1, | Vi | + 1). The 
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sufficient statistic for Q k i (k = 1, . . . , K; i = 1, . . . , Wk) has three components, 
(1) the count vector of unaligned current sites, C ki , (2) the count vector of 
ancestral sites, C ki , and (3) the count vector of aligned descendant sites with 
a broken evolutionary bond, C ki , since each of them is an independent sam- 
ple from Ofcj under our model. Then the conditional posterior distribution 
of e ki is Dir{C ki + 1) , where C ki = C 9 ki + C z ki + C b ki . 

A Metropolis-Hastings step is implemented to update motif widths. We 
illustrate our method by one motif as an example. Given the current width 
w and all sampled sites of this motif, we propose to increase or decrease 
one base at their left or right ends with equal probability. After choosing 
one of the four possibilities, the problem is equivalent to a model selection 
problem: The nucleotides observed in the selected positions are generated 
from the background (Hq) or from a motif column (Hi). If H\ is true, de- 
note the weight vector of the motif column by Q w and its sufficient statistic 
by C w calculated as described in the previous paragraph for any @ ki . Be- 
fore calculating C w , one needs to sample the associated evolutionary bonds 
V w , with \Vq"\ and \V]"\ denoting the numbers of broken and connected 
bonds. Under Hq, we denote by C b = [c^ , c[ X \..., c£ ] the count vectors 
of ancestral nucleotides (c[°^) and current unaligned nucleotides of different 

orthologs (c^ , m = 1, ...,7V) in the selected positions. Denote by Ct the 
substitution count matrix from an ancestral base to its descendant bases. 
Then we calculate the posterior odds of H\ with V w over Hq by 

P(H \9 ,<5> b ,Y,Z,A,S) 

, , <H Q ) P(C b ,C t \d ,^ b ,H ) 

(15) 

TrQgi) M/'^'q - m/) 1 ^ 1 J(e„) c -r(4) de w 
_ Trjih) ^1^1(1 - /vgiK + i)r(4) 

where we define R c = Uij ^ and T(R) = Uij ^(Rij) with R and C being 
matrices (vectors) of same size, \C W \ is the total counts in C w , and the ratio of 
tt(Hi) over tt(Hq) is determined by the Poisson prior for the motif width. In 
each iteration, we always propose to flip the two hypotheses. The proposal 
from Hq to Hi involves proposing evolutionary bonds for all the aligned 
nucleotides that are identical to their ancestors. We propose to break each 
of these bonds independently with probability [if. Under these proposals, 
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the Metropolis-Hastings ratio is 

P(H 1 ,V w \n f ,Y,Z,A,S) Q(H Q \Hi,V w ) 



Rmh 

(16) 



P(H \e ,$ b ,Y,Z,A,S) Q(H U V™\H ) 

P(H u V w \n f ,Y,Z,A,S) 1 

P(H \9o, $ 6; Y, Z, A, S) - vr\ 

where n s is the number of aligned nucleotides that are different from their 
ancestors, and Q stands for the proposal probabilities. Prom the definition 
of an evolutionary bond, we always have n s < |Vq"|. 

A. 5. Sample A given To consider the uncertainty in multiple align- 
ments, we update A by its marginal posterior distribution given the cur- 
rent parameters, that is, we want to sample from P(A\^, S). A Metropolis- 
Hastings step is implemented for this conditional sampling step. Suppose 
the current alignment is A. We propose a new alignment A* from an ordi- 
nary multiple sequence alignment procedure based on an HMM [Baldi et al. 
(1994), Krogh et al. (1994)], denoted by MA-HMM so as to distinguish from 
c-HMM. In MA-HMM, each sequence is aligned to a profile (or a hidden 
sequence template) based on an HMM with three states, aligned, insertion 
and deletion. In our proposal, the transition matrix between the three states 
is fixed by prior expectations as 

"0.998 0.001 0.001" 
D= 0.998 0.002 
.0.025 0.025 0.95 . 

where states are ordered as deletion, insertion and aligned. The rationale 
for selecting these values is that we expect to start an aligned block every 
500 bps (Dn = D22 = 0.998) and that the average length of an aligned block 
is 20 bps (D33 = 0.95). These values serve as the default parameters for all 
the results presented in this article. We use the current neutral substitution 
matrix as the emission probabilities from the profile (ancestral sequence 
Z) to a current aligned nucleotide. Ancestral and unaligned nucleotides are 

emitted from their respective background models 6^ , 9^ , . . . , 9q N ^ . Denote 
the probability of proposing the alignment A* by Q(A*\9q, & b , Z, S) under 
the MA-HMM. Then the Metropolis-Hastings ratio is 

P(A*\%S) Q(A\e ,$ b ,Z,S) _ P(S\A*,V) Q(S,Z\A,d ,$ b ) 
[ ' P(A\%S) Q(A*\e ,$ b ,Z,S) P(S\A,V) Q(S,Z\A*,d ,$ b y 

where P(S\A,^) and P(S\A* ,^) are calculated by (14) through the recur- 
sive forward summations. Q(S, Z\A, 0q, 3>&) and Q(S, Z\A*, 9q, $5) are the 
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probabilities of observing S and Z given alignments A and A* under the 
MA-HMM, respectively, that is, 

(is) Q(s,z\A,e ,$ b )= Uie^T" ($&) Ct , 

where c£ and Ct are defined similarly to those in (15) but for all the 
sequence positions. Note that the prior probabilities of an alignment are 
identical in both c-HMM and MA-HMM, and thus are canceled at the R.H.S. 
of equation (17). 

A. 6. A collapsed sampler. Suppose the data set contains n genes. Denote 
the missing data, including Y, Z, A and V, for the ith gene by -D m is,i, and let 

Si = {»sj m ^}m=i be the orthologous sequences of the ith gene, i = 1, 2, . . . ,n. 
Since collapsing random components in the Gibbs sampler usually results in 
more efficient sampling schemes [Liu, Wong and Kong (1994), Liu (1994)], we 
implement a collapsed MultiModule to sample from P(D m ; Si i, . . . , D m \ s ^ n \S) 
by iteratively scanning each gene. For the ith gene, given the current imputed 
values of 

^mis,[— i] — \-^mis,l > • ■ • j -^mis.i— 1 > ^mis,i+l j • • • j -^mis,n} > 

we first estimate the parameters by their conditional posterior means, 

(19) * H] =£(*|Z) m is,H],SH]), 

where SVji = {Si, . . . , 5j_i, Sj+i, . . . , S n }. Then we sample the missing data 
for the ith gene from 

(20) D^~Pp miM |# H] ,,%). 

Equation (19) can be easily calculated from the conditional posterior distri- 
butions of different parameters (Section A. 4), and equation (20) is exactly 
the same sampling procedure as described in Sections A. 3 and A. 5 taking 
as the current parameters. We want to emphasize that, although this 
is only an approximate way to collapse all the parameters, it indeed im- 
proves the convergence of the Gibbs sampler. The simulation studies were 
performed with this collapsed version of MultiModule. 
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